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Abstract 

Recently, Hajirasouliha and Raphael (WABI 2014) proposed a model for deconvoluting mixed 
tumor samples measured from a collection of high-throughput sequencing reads. This is re¬ 
lated to understanding tumor evolution and critical cancer mutations. In short, their formu¬ 
lation asks to split each row of a binary matrix so that the resulting matrix corresponds to a 
perfect phylogeny and has the minimum number of rows among all matrices with this prop¬ 
erty. In this paper we disprove several claims about this problem, including an NP-hardness 
proof of it. However, we show that the problem is indeed NP-hard, by providing a different 
proof. We also prove NP-completeness of a variant of this problem proposed in the same paper. 

On the positive side, we propose an efficient (though not necessarily optimal) heuristic algo¬ 
rithm based on coloring co-comparability graphs, and a polynomial time algorithm for solving 
the problem optimally on matrix instances in which no column is contained in both columns 
of a pair of conflicting columns. Implementations of these algorithms are freely available at 
https://github.com/alexandrutomescu/MixedPerfectPhylogeny, 
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1 Introduction 


Tumor progression is assumed to follow a phylogenetic evolution in which each tumor cell passes its 
somatic mutations to its daughter cells as it divides, with new mutations being accumulated over 
time. It is important to discover what tumor types are present in the sample, at what evolutionary 
stage the tumor is in, or what are the “founder” mutations of the tumor, mutations that trigger 
an uncontrollable growth of the tumor. These can lead to better understanding of cancer j2,23 
better diagnosis, and more targeted therapies 


DNA sequencing is one method for discovering the somatic mutations present in each tumor 
sample. The most accurate possible observation would come from sampling and sequencing every 
single cell. However, because of single-cell sequencing limitations, and the sheer number of tumor 
cells, one usually samples populations of cells. Even though the samples are taken spatially and 
morphologically apart, they can still contain millions of different cancer cells. Moreover, this 
mixing is not consistent across different collections of samples. Therefore, studying only these 
mixed samples poses a serious challenge to understanding tumors, their evolution, or their founding 
mutations. 

Solutions for overcoming this limitation can come from a computational approach, as one could 
deconvolute each sample by exploiting some properties of the tumor progression. One common 
assumption is that all mutations in the parent cells are passed to the descendants. Another one, 
called the “infinite sites assumption”, postulates that once a mutation occurs at a particular site, it 
does not occur again at that site. These two assumptions give rise to the so-called perfect phytogeny 
evolutionary model. Hajirasouliha and Raphael proposed in 10 a model for deconvoluting each 


sample into a set of tumor types so that the multiset of all resulting tumor types forms a perfect 
phylogeny, and is minimum with this property. Even though this model has some limitations, for 
example it assumes no errors, and only single nucleotide variant mutations, it is a fundamental 
problem whose understanding can lead to more practical extensions. 

Other major approaches for deconvoluting tumor heterogeneity include methods based on so¬ 
matic point mutations, such as PyClone p7|, SciClone [^, PhyloSub 15 , CITUP pO], LICHeE 26 


and methods based on somatic copy number alterations, such as THetA [25| , TITAN and Mix- 
Clone 
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Let us review two methods from the first category mentioned above. CITUP exhaustively 
enumerates through all possible phylogenetic trees (up to maximum number of vertices) and tries 
to decompose each sample into several nodes of the phylogeny. The fit between each sample and 
the phylogenetic tree is one minimizing a Bayesian information criterion on the frequencies of each 
mutation. This is computed either exactly, with quadratic integer programming, or with a heuristic 
iterative method. The tree achieving an optimal fit is output, together with the decompositions of 
each sample as nodes (i.e., sets of mutations) of this tree. 

Method LICHeE also tries to fit the observed mutation frequencies to an optimal phyloge¬ 
netic tree, but with an optimized search for such a tree. Mutations are first assigned to clusters 
based on their frequencies (a mutation can belong to more clusters). These clusters form the 
nodes of a directed acyclic graph (DAG). Directed edges are added to these graphs from a node to 
all its possible descendants, based on inclusions among their mutation sets and on compatibility 
among their observed frequencies. Spanning trees of this DAG are enumerated, and the ones best 
compatible with the mutation frequencies are output. 

As opposed to the problem proposed in and considered in this paper, these two meth¬ 
ods heavily rely on the mutation frequencies. Frequencies appear at the core of their problem 
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formulations, and without them they would probably output an arbitrary phylogenetic tree and 
a decomposition of samples into nodes of this tree compatible with the observed data. The two 
problems in this paper have the same output, but only assume data on absence or presence of mu¬ 
tations in each sample. The objective function of our first problem requires that the sum, over each 
sample, of the number of leaves of the phylogeny that the sample is decomposed to, is minimum. 
The second problem formulation only requires that output phylogeny has the minimum number of 
leaves. 

In this paper we show that several proofs from (D related to the optimization problem proposed 
therein are incorrect, including an NP-hardness proof of itQ However, the NP-hardness claim turns 
out to be correct; in this paper we provide a different NP-hardness proof. We also adapt this proof 
to a variant of the problem also proposed in but whose complexity was left open. This problem 
asks to minimize the set (instead of multi-set) of all tumor types of the perfect phylogeny. The two 
problems, formally defined in Section are referred to as the Minimum Conflict-Free Row 
Split and the Minimum Distinct Conflict-Free Row Split problem, respectively. 

Moreover, we obtain a polynomial time algorithm for a collection of instances of the Minimum 
Conflict-Free Row Split problem, which can be biologically characterized as follows. Say that 
two mutations i and j are exclusive if i is present in a sample in which j is absent, and j is present 
in a sample in which i is absent. Observe that exclusive mutations cannot both be present in the 
same vertex of a perfect phylogeny. Thus, we say that a sample is a mixture at exclusive mutations 
i and j if both i and j are present in that sample. The instances for which we can solve the problem 
in polynomial time are such that for any two exclusive mutations i and j, no mutation is present 
only in the samples mixed at i and j. 

We also propose an efficient (though not necessarily optimal) heuristic algorithm for the 
Minimum Conflict-Free Row Split problem, based on coloring co-comparability graphs, 
and provide implementations of both algorithms, freely available at https://github.com/ 
alexandrutomescu/MixedPerfectPhylogeny, 


Paper outline. In Section we give all formal definitions and review the approach of [10] . In 
Section[^we give a complete characterization of the so-called row-conflict graphs, the class of graphs 
considered in [10] . The complexity results are presented in Section and the above-mentioned 
polynomial time algorithm is given in Section]^ In Section]^ we discuss the heuristic algorithm for 
general instances, and in Sections and we present experimental results on the binary matrices 
from clear cell renal cell carcinomas (ccRCC) from [^. We conclude the paper with a discussion in 
Section [9l 

Some of the results in this paper appeared in the proceedings of WABI 2015 [^. In addition 

this paper contains all the missing proofs (complete proofs of 

a time complexity analysis of the 
algorithm presented in Section 5, a discussion following the proof of Theorem 5 on the necessity of 
the assumptions for the algorithm given in Section 5, and three additional sections (Sections 6, 7 
and 8) describing a polynomial time heuristic for general instances and experimental results. 
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to the material presented in 
Theorems 2, 3, and 4, a more detailed proof of Lemma 2) 


^Hajirasouliha and Raphael mentioned during their WABI 2014 talk that their claim about every graph being 
a row-conflict graph (Theorem 4 in 10 ) contained a flaw and proposed a correction stating that for every binary 
matrix M with an all-zeros row and an all-ones row, the complement of Gm,t- (for any row r of M) is transitively 
orientable (cf. Section]^ for the definition of Gm.t and Theorembelow). In particular, the fact that Theorem 4 
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is incorrect implies that the NP-hardness proof from jlOl is incorrect. 
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2 Problem formulation 


As mentioned in the introduction, we assume that we have a set of sequencing reads from each 
tumor sample, and that based on these reads we have discovered the sample variants with respect 
to a reference (e.g., by using a somatic mutation caller such as VarScan 2 [^). This gives rise to 
an m X n matrix M whose m rows are the different samples, and whose n columns are the genome 
loci where a mutation was observed with respect to the reference. The entries of M are either 0 or 
1, with 0 indicating the absence of a mutation, and 1 indicating the presence of the mutation. We 
assume that the matrix has no row whose all entries are 0. 

Under ideal conditions, e.g., each mutation was called without errors, and the samples do not 
contain reads from several leaves of the perfect phylogeny, M corresponds to a perfect phylogeny 
matrix. Such matrices are characterizable by a simple property, called conflict-freeness. 

Definition 1. Two columns i and j of a binary matrix M are said to be in conflict if there exist 
three rows r,r',r" of M such that = 1, M^/^i = = 0, and ^ = \. A 

binary matrix M is said to be conflict-free if no two columns of M are in conflict. 


It is well known that the rows of M are leaves of a perfect phylogenetic tree if and only if M is 
conflict-free (see [^|^). Moreover, if this is the case, then the corresponding phylogenetic tree can 
be retrieved from M in time linear in the size of M [^. 

However, in practice, each tumor sample is a mixture of reads from several tumor types, and 
thus possibly M is not conflict-free. If we are not allowed to edit the entries of M as done e.g. by 
methods such as |29|, 28 , Hajirasouliha and Raphael proposed in 10 to turn M into a conflict-free 


matrix M' by splitting each row r of M into some rows ri,... ,rk such that r is the bitwise OR of 
ri,..., that is, for every column c, = 1 if and only if Mr^^c = 1 for at least one r*. The rows 
ri,... ,rk can be seen as the deconvolution of the mixed sample r into samples from single vertices 
of a perfect phylogeny. One can then build the perfect phylogeny corresponding to M' and carry 
further downstream analysis. Let us make this row split operation precise. 


Definition 2. Given a binary matrix M G {0,1}™-^"- with rows labeled ri,r 2 ,... ,rm, we say that 
a binary matrix M' G {0,1}™ is a row split of M if there exists a partition of the set of rows of 
M’ into m sets R[, R' 2 , • •., R'm such that for all z G {1, 2,..., m}, is the bitwise OR of the binary 
vectors given by the rows of R[. The set R[ of rows of M' is said to be a set 0 /split rows of row r^. 


Observe that a simple strategy for obtaining a conflict-free row split of M is to split every row r 
into as many rows as there are Is in r, with a single 1 per row. While this might be an informative 
solution for some instances (cf. also Corollary on p. 14), Hajirasouliha and Raphael proposed 
in 10 as criterion for obtaining a meaningful conflict-free row split M' the requirement that the 


number of rows of M' is minimum among all conflict-free row splits of M. 

In this paper we consider the following problem, which we call Minimum Conflict-Free 
Row Split problem. For a binary matrix M, we denote by fl{M) the minimum number of rows 
in a conflict-free row split M' of M. This notation is in line with notation j{M) used in |10] 
to denote the minimum number of additional rows in a conflict-free row split M' of M, that is, 
7 (M) = 7 (M) — m, where m is the number of rows of M. 
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Minimum Conflict-Free Row Split: 
Input: A binary matrix M, an integer k. 
Question: Is it true that ^{M) < kl 


The optimization version of the above problem (in which only a given subset of rows needs to 
be split) was called the Minimum-Split-Row problem in |10j , however, all results from [10] deal 
with the variant of the problem in which all rows need to be split (some perhaps trivially by setting 
R' = {rj}), which is equivalent to the Minimum Conflict-Free Row Split problem. 

Given a binary matrix M and a row r of M, the conflict graph of (M, r) is the graph GM,r 
defined as follows: with each entry 1 in r, we associate a vertex in GM,r, and two vertices in GM,r 
are connected by an edge if and only if the corresponding columns in M are in conflict. Denoting by 


x{G) the chromatic number of a graph G, Hajirasouliha and Raphael proved in 10 the following 
lower bound on the value of 


Leuiuia 1. 10^ Let M be a binary matrix M with a conflict-free row split M'. Then, for every 

row ri of M with a set R[ of split rows of M’, we have |i?'| > x{GM,ri)- 


Corollary 1. For every binary matrix M, we have fl{M) > YlirX{GM,r)- 

Hajirasouliha and Raphael also claimed in the following hardness result. 

Theoreui 1. |7^ The Minimum Conflict-Free Row Split problem is NP-hard. 

To recall their approach for proving Theorem]^ we need one more definition. We denote the 
fact that two graphs G and H are isomorphic hy G = H. 


Defiuitiou 3. A graph G is a row-conflict graph if there exists a binary matrix M and a row r of 
M such that G = GM,r ■ 


The proof of Theorem]^ was based on a reduction from the chromatic number problem in graphs 
and relied on three ingredients: the lower bound given by Corollary Theorem 4 from 10 stating 
that every graph is a row-conflict graph, and an algorithm based on graph coloring, also proposed 
in [^, for optimally solving the Minimum Conflict-Free Row Split problem by constructing 
a conflict-free row split of M with exactly Ylr x{GM,r) rows. In particular, their results would 
imply that the lower bound on x{M) given by Corollary is always attained with equality. 

Contrary to what was claimed in , we show that there exist graphs that are not row-conflict 
graphs. In fact, we give a complete characterization of row-conflict graphs, showing that a graph 


is a row-conflict graph if and only if its complement is transitively orientable 


,see 




Using a 


reduction from 3-edge-colorability of cubic graphs, we show that it is NP-complete to test whether 
a given binary matrix M has a conflict-free row split M' with number of rows achieving the lower 
bound given by Corollary]^ (see ??). This implies that there exist infinitely many matrices for 
which this bound is not achieved. 

A corollary of our characterization of row-conflict graphs is that the chromatic number is poly- 
nomially computable for this class of graphs. This fact with the assumption that Pt^NP, as well 
as the existence of matrices M with x{M) > Ylr x{GM,r), each individually imply that the claimed 
NP-hardness proof of the Minimum Conflict-Free Row Split problem given in is flawed. 
Nevertheless, our NP-completeness proof (see Theorem]^ implies that Theoremis correct. 
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On the positive side, we give a polynomial time algorithm for the Minimum Conflict-Free 
Row Split problem on input matrices M in which no column is contained in both columns of a 
pair of conflicting columns (see Theorem]^. 

We also consider a variant of the problem, also proposed in |10[ , in which we are only interested 
in minimizing the number of distinct rows in a conflict-free row split of M. This problem is similar 
to the Minimum Perfect Phytogeny Haplotyping problem [^, in which we need to explain a set 
of genotypes with a minimum number of haplotypes admitting a perfect phytogeny. For a binary 
matrix M, we denote by r/(M) the minimum number of distinct rows in a conflict-free row split 
M' of M. We establish NP-completeness of the following problem (see Theorem |^, which was left 
open in [lo). 


Minimum Distinct Conflict-Free Row Split: 
Input: A binary matrix M, an integer k. 
Question: Is it true that fi{M) < kl 


3 A characterization of row-conflict graphs 

Definition 4. Given a binary matrix M and two columns i and j of M, column i is said to he 
contained in column j if holds for every k. The undirected containment graph Hm 

is the undirected graph whose vertices correspond to the columns of M and in which two vertices i 
and j, i ^ j, are adjacent if and only if the column corresponding to vertex i is contained in the 
column corresponding to vertex j or vice versa. 

Recall that an orientation of an undirected graph G = {V, E) is a directed graph D = (R, A) 
such that for every edge uv G E, either {u,v) G A or {v,u) G A, but not both. An orientation is 
said to be transitive if the presence of the directed edges (u, v) and {v, w) implies the presence of the 
directed edge {u,w). A graph is said to be transitively orientable if it has a transitive orientation. 
The complement of a graph G is a graph G with the same vertex set as G in which two distinct 
vertices are adjacent if and only if they are non-adjacent in G. Transitively orientable graphs 
appeared in the literature under the name of comparability graphs (and their complements under 
the name of co-comparability graphs). Transitively orientable graphs and their complements form 
a subclass of the well known class of perfect graphs [^. Therefore, odd cycles of length at least 5 
and their complements are examples of graphs that are not transitively orientable. 

Observation 1. For every binary matrix M, the graph Hm is transitively orientable. 

Proof. We say that column i is properly contained in column j if i is contained in j and 
for some k. Fix an ordering {ci,..., c^} of the columns of M. Let us define a binary relation C 
on the set of columns on M by setting, for every two columns Cj and Cj of M, Cj C Cj if and only 
if either Cj is properly contained in cj, or i < j and each of Ci and Cj is contained in the other one 
(that is, as binary vectors they are the same). Observe that for a pair of columns Cj and Cj with 
CiCj G E{Hm) we have either Cj C Cj or Cj C Cj but not both. The binary relation C defines an 
orientation of Hm, by orienting each edge CiCj as going from a to Cj if and only if a C Cj. This 
orientation can be easily verified to be transitive. □ 
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In the next theorem, we characterize row-conflict graphs (cf. Deflnition[^. 

Theorem 2. A graph G is a row-conflict graph if and only if G is transitively orientable. 

Proof. (=>) Let M be an arbitrary binary matrix, r an arbitrary row of M, and let G = Gm^t- Let 
N be the submatrix of M consiting of the columns of M that have 1 in row r. It is now easy to see 
that GM,r — GM,r- Moreover, any two columns of N are either in conflict or their corresponding 
vertices are adjacent in H^. Therefore, Hf^ = GN,r- Since H]\f is transitively orientable (by 
Observation!^, it follows that G is transitively orientable as well. 

(<J=) We follow the strategy of the proof of Theorem 4 in [10] (which works for complements 
of transitively orientable graphs). For the sake of completeness, we include here a short proof of 
this implication. Let G be a graph such that H = G is transitively orientable, with a transitive 
orientation . It can be easily seen that ^ is acyclic, thus we may assume that vertices of G are 
topologically ordered as V(G) = {ui,..., Vn}, that is, for every directed edge (uj, vj) in 1^, we have 
i < j. Let E{G) = {ei, 62 ,..., em}- We construct a matrix M with n columns and 2m + 1 rows, 
such that Gm,i — G. The first row of M is defined to have all entries equal to 1. For every edge 
Cfc = ViVj, i < j, of G, the 2k-th row of M has entry 0 in the column corresponding to vertex Vi, 
and entry 1 in the column corresponding to Vj. Additionally, the {2k l)-st row of M has entry 1 
in the column corresponding to vertex vt, and entry 0 in the column corresponding to Vj. Since the 
first row has all entries equal to 1, after filling in these entries of M, the two columns corresponding 
to Vi and vj, respectively, are in conflict. 

We need to fill in the remaining entries of M so that we do not introduce any new conflicts. For 
every i, we All in the remaining entries so that whenever (uj, Vj) is a directed edge in 1^, the column 
corresponding to the vertex Vi is contained in the column corresponding to the vertex Vj. This can 
be achieved by examining the columns one by one, following the topological order (ui,..., Vn) of 1^, 
and Ailing each unfilled entry with a 0, unless this would violate the above containment principle. 

At the end of this procedure, there are no conflicts between columns corresponding to vertices 
Vi and Vj, whenever {vi,Vj) is a directed edge in Therefore, Gm,i — G. □ 

Theorem implies that odd cycles of length at least 5 and their complements are not row- 
conflict graphs. The reader not familiar with transitively orientable graphs might And it useful to 
verify that the cycle of length 5 cannot be transitively oriented. 


4 Complexity results 

Theorem 3. The following two problems are NP-complete: 

• The Minimum Conflict-Free Row Split problem. 

• Given a binary matrix M, is it true that fl{M) = Yhr x{GM,r) ? 

Proof. The Minimum Conflict-Free Row Split problem is in NP, since testing if a given bi¬ 
nary matrix M' with at most k rows, equipped with a partition of its rows into m sets, satisfies the 
condition in the definition of a row split, as well as the conflict-freeness, can be done in polynomial 
time. To argue that the second problem is in NP, we proceed similarly as above, performing an ad¬ 
ditional test checking that the number of rows of M' equals x{GM,r)- (In this case, we will have 
7 (M) < '^j.x{GM,r) and equality will follow from Corollary [^) The value of YlrX{GM,r) can be 
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computed in polynomial time, since each graph GM,r is the complement of a transitively orientable 
graph (by Theorem]^, and the chromatic number of complements of transitively orientable graphs 
can be computed in polynomial time (see, e.g., (^). 

We prove hardness of both problems at once, making a reduction from the following NP-complete 
problem 11 : Given a simple cubic graph G = {V,E), is G 3-edge-colorable? (A graph is cubic, 
or 3-regular, if every vertex is incident with precisely three edges. A matching in a graph is a 
set of pairwise disjoint edges. A graph is 3-edge-colorable if its edge set can be partitioned into 3 
matchings.) 

Given a simple cubic graph G = {V,E), we construct an instance {M,k) of the Minimum 
Conflict-Free Row Split problem as follows: 


• M is a (|F| -|- 3) X (|ili| -|- 3) binary matrix, with rows indexed by F U {ri,r 2 ,r 3 }, columns 
indexed hy EVJ {ci, C 2 , cs}, and entries defined as follows (see Fig. [^for an example): 


— For every row indexed by a vertex u G F and every column indexed by an edge e, we 
have 


J 1, if u is an endpoint of e; 
( 0, otherwise. 


— For every row indexed by a vertex u G F and every column indexed by some c G 
{ci,C 2 ,C 3 }, we have = 1- 

— For every row indexed by some r G {ri,r 2 ,r 3 } and every column indexed by an edge 
e € E, we have Mr^e = 0- 

— For every row indexed by some r* G {I’l, i’ 2 ) I’s} and every column indexed by some 
Cj G {ci, C2, C3}, we have 


M, 


ri,Cj 


1, if i = j 
0, otherwise. 


, k = 3|F| +3. 


G = {V,E) 



M fe = 15 
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Figure 1: An example construction of (M, k) from G. 

Note that for each row indexed by a vertex v G V, the graph Gm,v is isomorphic to the disjoint 
union of two complete graphs with three vertices each, hence x(Gm,d) = 3. For each row indexed 
by some r G {ri,r 2 ,r^}, the graph GM,r consists in a single vertex, thus x{GM,r) = 1- It follows 
that k = YlrX{GM,r) and therefore M is a yes instance to the second problem (“Given a binary 
matrix M, is 7 (M) = x(G^M,r)?”) if and only if {M,k) is a yes instance for the Minimum 











Conflict-Free Row Split problem. Hardness of both problems will therefore follow from the 
following claim, which we prove next: G is 2i-edge-colorable if and only ifj{M) < k. 

Suppose hrst that G is 3-edge-colorable, and let E = Ei U E 2 U he a partition of E into 
3 matchings. We obtain a row split M' of M by replacing each row of M indexed by a vertex 
V € V with three rows and keeping each row of M indexed by some r G {i'i,r 2 ,r 3 } unchanged. 
Clearly, this will result in a matrix with k rows. For every u G F, we replace the row of M 
indexed by v as follows. Vertex v is incident with precisely three edges in G, say 61 , 62 , 63 . Since 
Ei,E 2 ,E 3 are matchings partitioning E, we may assume, without loss of generality, that 6* G Ei 
for all i G {1, 2, 3}. The three rows replacing in M' the row of M indexed by v are indexed by 

and defined as follows: for every i G {1, 2, 3} and every column c £ E U {ci, 62 , 63 }, we have 

^ if 6 = 6 i or c = Ci] 

I 0 , otherwise. 

By construction, M' is a row split of M with k rows. We claim that M' is conflict-free. No pair of 
columns indexed by two edges in E agree on value 1 in any row, hence they cannot be in conflict. 
The same holds for any pair of columns indexed by two elements of { 61 , 62 , 63 }. Consider now 
two columns, one indexed by an edge e £ E and one indexed by c* G { 61 , 62 , 63 }. Without loss of 
generality, we may assume that e £ Ei. There are only two rows in which the column indexed by e 
has value 1, namely the rows indexed by copies of the endpoints of 6 , say and (with u,v £ V). 
The values of M' in column c* at rows and are both 1 (if i = 1), otherwise they are both 
0. Consequently, the two columns cannot be in conflict. Since M' is a conflict-free row split of M 
with k rows, this establishes 7 (M) < k. 

For the converse direction, let M' be a conflict-free row split of M with at most k rows. Let 
V' = V U { 61 , 62 , 63 } and consider a partition {R[ \ i £ F'} of the set of rows of M' into |F| -|- 3 
sets indexed by elements of V such that for all i £ V\ the row of M indexed by i is the bitwise 
OR of the rows of Since A: is a lower bound on 7 (M), matrix M' has exactly k rows. This fact 
and Corollary imply that each row in M indexed by a vertex v £ V has | ii} | =3 and each row 
indexed by some r £ { 61 , 62 , 63 } has |R(,| = 1 . 

We must have that for all z G F', the row of M indexed by i is the bitwise sum of the rows of 
R[, that is, for every column c £ EU { 61 , 62 , 63 }, we have = XlreR' ^r,c- Indeed, if for some 
i £ V and some column c £ E U {ci, 62 , 63 }, we have that J2reR' ^r,c > I; then z is a vertex of G. 
Furthermore, since |R'| =3, there are either two edges of G, say e and /, incident with i such that 
for some r £ R', we have g = ^ = 1 , or there are two distinct elements e, f £ {ci, 62 , 63 } with 

the same property. In the former case, considering the rows replacing the rows of M indexed by 
the endpoints of e and / other than z, respectively, we find two distinct rows r' and r" of M such 
that M}, g = M}„ j = 1 and g = M}, j = 0, which contradicts the fact that M' is conflict-free. 
In the latter case, the argument is similar. 

By permuting the rows of M' if necessary, we may assume that each set of the form R} is 
ordered as R{ = {v^,v ^so that 


M' = / 

{ 0 , otherwise. 

We claim that for every edge e = uv £ E and every z G {1, 2, 3}, we have that Mb ^ = Mb If 
this was not the case, then we would have Mb ^ = AfC ^ = 1 for a distinct pair i,j £ {1, 2, 3}. But 
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then the columns of M' indexed by e and Cj would both agree in value 1 in row indexed by u® and 
disagree (in opposite directions) in rows indexed by and u®. Thus, they would be in conflict, 
contrary to the fact that M' is conflict-free. 

Since for every edge e = uv € E and every i G {1,2,3}, we have that Mb = Mb , we can 
partition the edges of E into three pairwise disjoint sets Ei,E 2 , E^ by placing every edge e = uv € E 
into Ei if and only if i G {1,2, 3} is the unique index such that Mb ^ = M'^i ^ = 1. We claim that 
each Ei is a matching in G. This will imply that G is 3-edge-colorable and complete the proof. 
If some Ei is not a matching, then there exist two distinct edges, say e, f € Ei with a common 
endpoint. Let e = xy and f = xz. The columns of M' indexed by e and / agree in value 1 at 
row indexed by x®, while they disagree (in opposite directions) in rows indexed by y® and z®. Thus, 
they are in conflict, contrary to the conflict-freeness of M'. □ 


Hajirasouliha and Raphael proposed in 10 an algorithm based on graph coloring for optimally 


solving the Minimum Conflict-Free Row Split problem by constructing a conflict-free row 
split of M with exactly Ylr x(G*M,r) rows. Since there are infinitely many cubic graphs that are 
not 3-edge-colorable (see, e.g., [^), the proof of Theorem [^implies that there exist infinitely many 
matrices M such that 7 (M) > X]rT(G*M,r)- On such instances, the algorithm from 10 will not 
produce a valid (that is, conflict-free) solution. 

Since the smallest cubic 4-edge-chromatic graph is the Petersen graph, the smallest matrix 
M with 7 (M) > J2rXiGM,r) that can be obtained using the construction given in the proof of 
Theorem is of order 13 x 18. A smaller matrix M for which the bound from Corollary is not 
tight can be obtained by applying a similar construction starting from the complete graph of order 
3 (which is a 2-regular 3-edge-chromatic graph): 


M = 


/1 

1 

0 

1 

1 \ 

1 

0 

1 

1 

1 

0 

1 

1 

1 

1 

0 

0 

0 

1 

0 

V 0 

0 

0 

0 

1 / 


We leave it as an exercise for the reader to verify that YlrX{GM,r) = 8 and 7 (M) > 9 (in fact, 
7 (M) = 9). Let us also remark that in |16[ Section 4.2.1] a binary matrix M is given with 
7 (M) = J2rX{GM,r), on which the algorithm from fails to produce a conflict-free solution. 

We conclude this section with another hardness result. 


Theorem 4. The Minimum Distinct Conflict-Free Row Split problem is NP-complete. 

Proof. Membership in NP of the Minimum Distinct Conflict-Free Row Split problem can 
be argued similarly as for the Minimum Conflict-Free Row Split problem. It suffices to argue 
that there is a polynomially-sized conflict-free matrix M' such that M' is a row split of M with at 
most k distinct rows. We may assume that for a partition R[,..., of rows of M' into m sets 
satisfying the condition in the definition of a row split, the rows within each are pairwise distinct. 
Recall (e.g. from [^) that a conflict-free matrix with d distinct rows and n columns corresponds to 
a perfect phylogenetic rooted tree T such that: T has d leaves (the rows of the matrix), all internal 
vertices of T are branching, and all edges from a vertex to its children are injectively labeled with 
a column of M, with the exception of at most one edge which is unlabeled. Thus T has at most 2n 
edges, and we infer that d < 2n. Therefore, the total number of rows of M' does not exceed 2nm, 
where m and n are the numbers of rows and columns of M, respectively. 
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The hardness proof is based on a slight modification of the reduction used in the proof of 
Theorem]^ (See Fig. for an example.) Given a cubic graph G = {V,E), we map it to {M,k) 
where 


• M is the binary matrix obtained from the binary matrix M described in the proof of Theo¬ 
rem]^ by adding to it three columns di,d 2 ,d 3 , which on the rows indexed by V equal 0, and 
on the rows indexed by ri,r 2 ,r 3 , each d* equals Cj, i G {1,2,3}. 

• k = \E\ + 3. 


M k = 9 

ei 62 63 64 65 66 61 62 63 dl d <2 d‘i 

Ui/I01100111000\ 
1)2 110010111000 

V3 011001111000 

1)4 000111111000 

ri 000000100100 

ra 000000010010 

r3\000000001001/ 


G = {V, E) 



Figure 2: An example construction of {M,k) from G. 

We claim that (M, k) is an instance of the Minimum Distinct Conflict-Free Row Split 
problem such that G is 3-edge-colorable if and only if r]{M) < k. 

Suppose that G is 3-edge-colorable. Given a partition of E into three matchings E = E 1 UE 2 DE 3 , 
we construct the same matrix M' as described in the proof of Theorem to which we append the 
three columns indexed by di,d 2 , ds which are all Os on the rows indexed by vertices, and which are 
the same as in M on the rows ri,r 2 ,r^. By the same argument given in the proof of Theorem M' 
is conflict-free. Each row r^, i G {1, 2, 3}, is distinct from all other rows of M'. Let u*, i G {1, 2, 3}, 
be a row corresponding to a vertex v and suppose Mb ^ = 1 , where e = uv is one of the three edges 
incident to v, and e G E^. By construction, the only other row having a 1 in column e is tt*. Thus, 
row u* is different from all other rows, except u*. In fact, we can see that row u* is identical to row 
u*, since they have no other entry 1 on the columns indexed by edges. Additionally, they both have 
1 in column Cj, since e G Ei, and 0 in the other hve columns in {ci, C 2 , C 3 , di, d 2 , d 3 } \ {cj}. Hence, 
the number of distinct rows of M' is at most 3|R|/2 -|- 3 = |E| -|- 3 = fe, since G is cubic, and thus 
ri{M) < k. 

For the converse direction, suppose that M' is a conflict-free row split of M with at most 
k = \E\ + 3 distinct rows. Let V' = V U {ri,r 2 ,r 3 } and consider a partition {R[ \ i G V'} of the 
set of rows of M' into |R| -|- 3 sets indexed by elements of V such that for all i G W, the row of 
M indexed by i is the bitwise OR of the rows of i?(. We will prove that (1) the number of pairwise 
distinct rows in R'^ is 3 for all v £ V, and that (2) the number of pairwise distinct rows in Rr is 1 
for all r G {ri,r 2 ,r 3 }. Applying the same approach as in the proof of Theorem will then imply 
that G is 3-edge-colorable. 

As argued in the proof of Theorem no row in has two Is on two columns indexed by two 
edges, say e and /, because each of e and / has an endpoint which is not an endpoint of the other 
edge (and thus a row with two Is on two columns indexed by two edges would imply a conflict in 
M'). Moreover, no row in has two Is on two columns indexed by ci, C 2 , C 3 . 


11 










Let us associate with each row of M' belonging to some R[ with i the edge column where 
it has a 1 (if there is any). Since each edge column contains a 1 and no row has two Is on the 
columns indexed by edges, the number of pairwise distinct rows of M' indexed by a vertex is at 
least \E\. Since in each i G {1, 2, 3}, we must have at least one row distinct from all other rows 
of M' (because of the Is in columns di, d 2 , ds), and M' has at most \E\ + 3 pairwise distinct rows, 
the number of distinct rows of M' is exactly \E\ + 3. This directly implies (2), more precisely, that 
each Rr consists only of a row identical to the corresponding row of M. 

In order to prove property (1), suppose now that there is a row of M indexed by a vertex v such 
that R'^ contains at least 4 pairwise distinct rows. Observe first that there is no row in R'^ having 
a 1 only in one column among { 01 , 02 , 03 } (and only Os in the columns indexed by edges). Indeed, 
besides being distinct from the row in each i?}, r G { 01 , 02 , 03 }, it would also be distinct from each 
of the set of at least \E\ rows of M' having a 1 on a column indexed by an edge. Thus this would 
contradict the fact that M' has at most |i?| + 3 pairwise distinct rows. This implies that there are 
two distinct rows v' and v” in such that v' and v" both contain a 1 on the same column indexed 
by an edge, say e, but on a column among {ci, C 2 , C 3 }, say Cj, v' contains 1 and v" contains 0. This 
shows that there is a conflict in M', since M{, g = 0 and M}. g. = 1, a contradiction. □ 

5 A polynomially solvable case 

In this section we consider the binary matrices in which no column is contained in both columns of 
a pair of conflicting columns, and derive a polynomial time algorithm for the Minimum Conflict- 
Free Row Split problem on such matrices. The main idea behind the algorithm is the fact that 
on such matrices the lower bound from Corollary is achieved, and the bound can be expressed 
in terms of parameters of a set of derived digraphs, the so-called directed containment graph (see 
Definition below). 

Let M be a binary matrix such that no column of M is contained in two or more conflicting 
columns. If there are duplicated columns in M, then we form a new matrix where we take just one 
copy of the columns that are duplicated. Since an optimal solution of the reduced instance can be 
mapped to an optimal solution of the original instance (by duplicating the columns corresponding 
to the copies of the duplicated columns in M kept by the reduction), we may assume that there 
are no duplicated columns in M. 

Definition 5. Given a binary matrix M with distinct columns ci,...,Cn and a row r of M, the 
directed containment graph of {M,r) is the graph whose vertex set is the set of columns of 

M having a 1 in row r, in which there is a directed edge from a to Cj if and only if z 7 ^ } and a is 
contained in Cj. 

We will use the notation a \Zr Cj as a shorthand for the fact that (cj, cj) is a directed edge of 
We say that Cj is a source of M,r if Q G V(^M,r) and there is no Cj with Cj Cr Cj. Let a{M,r) 
denote the number of sources in H M,r- 

Lemma 2. If there are no duplicated columns in M, then cr{M, r) < x{GM,r) holds for any row r 
ofM. 

Proof. Two vertices in the complement of are adjacent if and only if the corresponding columns 
of M are either disjoint or one contains the other one. However, since the vertices of both M,r 
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and Gmt correspond to columns in which M has value 1 in row r, no two such columns can be 

— y 

disjoint. Consequently, the underlying undirected graph of HM,r is equal to the complement of 
GM,r- The set of all sources of M,r forms an independent set in its underlying undirected graph. 
This set corresponds to a clique in GM,r- Therefore cr{M,r) < uj{GM,r) < x{GM,r) (where Lo{GM,r) 
denotes the maximum size of a clique in GM,r)- D 

Our algorithm is the following one (see also Fig. |^for an example). 


Input: An m x n binary matrix M with columns ci, C 2 ,..., Cn, without duplicated columns, 
and such that no column of M is contained in both columns of a pair of conflicting columns. 
Output: A conflict-free row split M' of M with 7 (M) rows. 

Algorithm: 

1. Define a new matrix M' with columns c'^, c^,..., c(j. 

2. For each row r of M, add the rows r[,..., to M', defined as: 

Let Cj., 1 ,..., Cr^cr{M,r) be the sources of M,r- 

j^i _ / ~ 

\ 0, otherwise. 

3. Return M'. 


M Cl C2 C3 C4 Cs 

/ 1 0 0 0 0 \ 

r 1 1 1 1 nh 
0 10 10 

0 110 1 

\ 0 1 0 0 0 / 



Cl Cf, C3 C4 Cg 


M' 

"l C 2 

rj 1 0 0 0 0 

ra 0 1 1 0 0 

0 1 0 1 0 


Ti 


M,r 


Cl, C3 ^ C4^ 

Cr,l ^r ,2 C1.3 



Figure 3: An example of a matrix M in which no column is contained in both columns of a 
pair of conflicting columns (ci,C 2 and 03,04 are conflicting). The rows 01 , 02,03 constructed by 
the algorithm corresponding to row r of M are shown in the center. On the right, the directed 
containment graph of (M, r). 


Theorem 5. For any mxn binary matrix M without duplicated columns such that no column of M 
is contained in both columns of a pair of conflicting columns, it holds that fl{M) = x{GM,r) = 
^^cr(A/, r). Moreover, a conflict-free row split M' of M with fl{M) rows can be constructed in 
time 0{mn^). 

Proof. We claim that the matrix M' produced by the above algorithm is a conflict-free row split of 
M with number of rows equal to 

It is clear that M' is a row split of M. Let us prove that M' is conflict-free. Suppose the 
contrary, that is, let c( and c'j be two columns of M' which are in conflict. Then, there exists a row 
of M' (obtained by splitting a row r of M) which has 1 in columns c( and c'y 
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We will first show that Cj is contained in Cj or vice versa. If Cj-^k = Q (resp. Cr^k = Cj) then 
Ci Cr Cj (resp. Cj Cr Ci) and therefore column a is contained in column Cj (resp. Cj is contained in 
Cj). Suppose now that Cr^k 0 {ci-,Cj}. Since r(, has 1 in columns c' and c'- it follows that Cr^k Cr Q 
and Cr^k Cr Cj. This implies that column Cr^k is contained in both column a and column Cj. By the 
assumption on M, Cj and Cj cannot be in conflict, hence, one of them is contained in the other one. 

Thus, we may assume without loss of generality that Cj is contained in Cj. Since c' and c'- are 
in conflict it follows that there exists a row w'^ of M' which has 1 in column c( and 0 in column 
c'y This implies that the corresponding row w of M has 1 in column Cj, and consequently also in 

Cj, since Cj is contained in Cj. Therefore, both Cj and Cj are vertices of If Q = c^^i, then 

w'^ has value 1 in column c'- (since Cj is contained in Cj), which contradicts the choice of w'^. Thus, 

Cj / Cw,e and Cw,i Ci. However, since Cj is contained in Cj and m,w is transitive, it follows that 
Ciu^i \Zw Cj. This implies that row w'^ has value 1 in column c'j, which again contradicts the choice 
of w'^. This finally shows that M' is conflict-free. 

Since the number of rows in M' is conflict free, we have 7 (M) < 

^^cr(M, r). By Corollary and Lemma we have — J2rX{GM,r) < xiM). This 

implies equality. 

It remains to justify the time complexity. First, we compute, in time 0{mn‘^), the transitive 
orientation of the undirected containment graph Hm as specified by Observation (that 

is, (ci,Cj) is an arc of m if and only if Cj is properly contained in Cj). Since for each row 
r of M, the graph M,r is an induced subdigraph of the a{M,r) sources of M,r can 

be computed from m in the straightforward way in time O(n^). The corresponding a{M,r) 
new rows of M' can be computed in time 0{a{M,r)n), which results in total time complexity of 
0{mv?‘) + 0{Y^^a{M,r)n) = 0{m'n?‘), as claimed. □ 

Note that the correctness of the algorithm crucially relies on the assumption that no column of 
the input matrix is contained in both columns of a pair of conflicting columns. For example, the 

1 1 IX 

0 I 0 I , which violates 
0 0 1 / 

the assumption (column ci is contained in both columns C 2 and C 3 , which are in conflict). Given 
the above matrix M, the output matrix M' computed by the algorithm is in fact equal to M. 

It is also worth mentioning that if the input matrix satisfies the stronger property that no column 
is contained in another one. Theorem implies that the naive solution obtained by splitting each 
row r into as many Is as it contains always produces an optimal solution. This is true since all 
vertices of '^M,r are sources. We thus obtain: 

Corollary 2. For any binary m x n matrix M such that no column of M is contained in another 
one, it holds that 7 (M) = m!, where m! equals the number of Is in M. Moreover, a conflict-free 
row split M' of M of size m! x n can be constructed in time 0{m'n). 

6 A heuristic algorithm based on coloring co-comparability graphs 

As pointed out in Section the graph theoretic algorithm from [10[ Section 4] fails to always 
produce a conflict-free row split of the input matrix. In this section, we propose a polynomial time 
heuristic algorithm for the Minimum Conflict-Free Row Split problem, that is, an algorithm 


algorithm fails to resolve the conflict in the 3x3 input matrix M = 
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that always produces a conflict-free row split of the input matrix. This algorithm is also based on 
graph colorings. 

Before presenting the algorithm, we describe the intuition behind it. The lower bound on 7 (M) 
given by Corollary follows from the fact that in every conflict-free row split M' of the input 
matrix M, the rows replacing row r in the split can be used to produce a valid vertex coloring of 
GM,r, the conflict graph of (M,r). The difficulty in reversing this argument in order to obtain a 
row split of M having a number of rows close to the lower bound xiGM,r) is due to the fact 
that we cannot independently combine the splits of rows r of M according to optimal colorings of 
their conflict graphs, as new conflicts may arise. 

We can guarantee that the corresponding row splits will be pairwise compatible (in the sense 
that no new conflicts will be generated) as follows. We color Gm, the conflict graph of the whole 
input matrix (which we will define in a moment), and split each row r according to the coloring 
of its conflict graph GM,r given by the restriction of the coloring of Gm to the vertex set of GM,r- 
The graph Gm, the conflict graph of M, is defined as follows: with each column of M, we associate 
a vertex in Gm- Two vertices in Gm are connected by an edge if and only if the corresponding 
columns in M are in conflict. Note that each conflict graph GM,r of an individual row is an induced 
subgraph of Gm, hence the restriction of any proper coloring c of Gm to V{GM,r) is a proper 
coloring of GM,r- 

The above approach will result in a row split having number of rows given by the value of 
{GM,r))\, where \c{y{GM,r))\ denotes the number of colors used by c on V{GM,r)- As a 
first heuristic attempt to minimize this quantity, Kacar proposed in [16[ Section 4.2.2] to choose 
a coloring c of Gm with x{Gm) colors. However, this is computationally intractable. While row- 
conflict graphs are characterized (in Theorem as exactly the co-comparability graphs (that is, 
as complements of transitively orientable graphs)—for which the coloring problem is polynomially 
solvable —, conflict graphs of binary matrices do not enjoy such nice features. Indeed, if G is any 
graph of minimum degree at least 2, then G = Gm, where M G {0, is the edge-vertex 

incidence matrix of G (defined by = 1 if and only if vertex v is an endpoint of edge e). 

This can be amended as follows. We can “restore” the structure of co-comparability graphs by 
observing that Gm is a spanning subgraph of Hm, the complement of the undirected containment 
graph Hm (cf. Definition]^, and working with Hm instead. Recall that Hm is the undirected 
graph whose vertices correspond to the columns of M and in which two vertices i and j, i j, 
are adjacent if and only if one the corresponding columns is contained in the other one. To show 
that Gm is a spanning subgraph of Hm, note first that we may assume that V{Gm) = R(^Im) (as 
both vertex sets are in bijective correspondence with the set of columns of M). Moreover, if two 
vertices i and j of Gm are adjacent, then the corresponding columns are in conflict, which implies 
that neither of them is contained in the other one; consequently, they are adjacent in Hm- 

Since Gm is a spanning subgraph of Hm, any proper coloring of Hm is also a proper coloring 
of Gm- Moreover, even though the graph Hm might have more edges than Gm, these additional 
edges (if any) will not be contained in any of the graphs GM,r- Indeed, for every row r, its conflict 
graph GM,r coincides both with the subgraph of Gm induced by U := V{GM,r) as well as with the 
subgraph of Hm induced by U. This is because for any two vertices i and j in U that are adjacent 
in Hm, the corresponding columns cannot be disjoint, therefore, since i and j are not adjacent in 
Hm, the corresponding columns must be in conflict. 

In view of the above observations, we propose choosing an optimal coloring c of the co¬ 
comparability graph Hm as a heuristic approach to minimizing the value of Yr |c(R(GM,r))| for a 
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coloring c of Gm- A row split of M is then defined according to the coloring c. 
This leads to the following algorithm (see Fig. for an example): 


Input: An m x n binary matrix M with columns labeled with 1,..., n. 
Output: A conflict-free row split M' of M. 

Algorithm: 

1. Define a new matrix M' with n columns labeled with 1,..., n. 

2 . Compute Hm, the complement of the undirected containment graph Hm- 

3. Compute an optimal coloring c of Hm- 

4. For each row r oi M: 

Let c{V{GM,r)) = {s\,...,sl}. 

Add the rows ..., to M', defined as: 

if = 1 and c(j) = 

''i’i \ 0 , otherwise. 

5. Return M'. 


M 1 2 3 4 5 

/ 1 0 0 0 0 \ 

r I 1 1 1 1 0 

0 10 10 

0 110 1 

\ 0 1 0 0 0 / 


M' 



1 2 3 4 5 

r'l 1 0 0 0 0 

^2 0 0 0 1 0 

0 1 1 0 0 



■Sli 52 , S 3 - colors 
as used by c 


Figure 4: An example of a binary matrix M, the complement of its undirected containment graph 
together with an optimal coloring c, and a split of a row of M according to the above algorithm. 


Theorem 6. For any m x n binary matrix M, the above algorithm can be implemented to run in 
time 0(n^(n^/^ -|- m)). The matrix M' output by the algorithm is a conflict-free row split of M. 


Proof. We first show that the matrix M' produced by the above algorithm is a conflict-free row 
split of M. Clearly, M' is a row split of M. We say that a row r' of M' is an r-row if r is 
the row of M such that r' was added to M' in step 4) of the algorithm when considering row 
r. Suppose for a contradiction that M' is not conflict-free, and let {jG'} be a pair of conflicting 
columns of M'. Then, there exist rows p, q, and r of M and rows pf q'f., and of M' such that 


P- 


is a p-row, q'j, is a g-row, and 


= 0 , and 


is an r-row, M', . = M’, •, = 1, Mh =1, M', 

’ p'iT ’ qfj ’ q'kT 

M’ , . = 0, M' , 1 = 1. Since M' , . = M' , = M', . = M' , •, = 1, the definition of M' implies that 
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Mpj = Mpji = Mqj = Mrji = 1 and c(j) = c{j') = c{j) = s^, and c(/) = s\. Consequently, 
= s\ = s'[. Moreover, since = 0 and c{j') = s\, we infer that M^j/ = 0 and, similarly, 

since M^, j = 0 and c(j) = s^, we infer that Mrj = 0. It follows that columns j and j' of M are 


in conflict. On the other hand, since c(j) = c{j') and c is a proper vertex coloring of Hm, vertices 
corresponding to j and j' are non-adjacent in Hm- Hence, they are adjacent in Hm, which means 
that one of the columns j and j' is contained in the other one, contradicting the fact that they are 
in conflict. This completes the proof that M' is conflict-free. 

It remains to justify the time complexity. The graph Hm can be computed in time 0{mn‘^) by 
comparing every ordered pair of columns for containment. In the same time, we can also compute 
a transitive orientation C of Hm (as done for example in Observation]^. An optimal coloring 
of Hm corresponds to a minimum chain partition of the partially ordered set P = {V{Hm),P 
), and can be computed in time 0{n^^'^) as follows (cf. 24 and 19, p. 73-74]). Applying the 


approach of Fulkerson , a minimum chain partition of P can be computed by solving a maximum 
matching problem in a derived bipartite graph having 2n vertices. This can be done in time 0(n^/^) 
using the algorithm of Hopcroft and Karp [12] . Step 4) of the algorithm can be executed in time 
\c{y{GM,r)\n) = 0{mv?). The claimed time complexity of 0{v?{'n}G _)_ jtj)) follows. □ 


7 Implementation and experimental results 

C-|—l- implementations of both algorithms are available at https : //github. com/alexandrutomescr./ 
MixedPerfectPhylogeny, The input matrices must be in .csv format, and are allowed to have du¬ 
plicate columns. In addition to a conflict-free row split matrix, we also output its perfect phylogeny 
tree. We tested our implementation on the ten binary matrices constructed from clear cell renal 
cell carcinomas (ccRCC) from [^: EV001-EV003, EV005-EV007, RMH002, RMH004, RMH008, 
RK26. The phylogenetic trees constructed by from these matrices appear in E Fig. 3]. 


Table 1: The numbers of rows, columns, and pairwise distinct columns in the input matrices, the 
lower bounds on the minimum number of rows in conflict-free row splits of the input matrices given 
by Corollary and the numbers of rows and pairwise distinct rows in the matrices output by the 
heuristic algorithm. The distinct rows form the leaves of the output perfect phylogeny. 


Name 

#rows 

Input 

#cols ^distinct cols 

lower bound 
on ^rows in output 

#rows 

Output 

^distinct rows 

EVOOl 

10 

122 


22 

37 

51 

22 

EV002 

7 

103 


17 

18 

29 

17 

EV003 

8 

56 


12 

13 

19 

12 

EV005* 

7 

83 


10 

7 

7 

7 

EV006 

9 

76 


11 

13 

25 

11 

EV007 

8 

66 


15 

19 

25 

15 

RHM002 

5 

54 


11 

9 

13 

11 

RHM004 

6 

140 


17 

16 

21 

17 

RHM008 

8 

81 


10 

10 

20 

10 

RK26 

11 

75 


17 

18 

26 

17 


* Input is conflict-free 
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These ten matrices have between 5 and 11 rows, and between 55 and 140 columns. One of 
them is already conflict-free, while the other nine do not belong to the polynomially-solvable case 
discussed in Section On each of these nine matrices, our heuristic algorithm from Section 
runs in less than one second. On ten random binary matrices with 50 rows and 1000 columns, 
the heuristic algorithm runs on average in 97 seconds. Due to the restricted structure of the input 
matrices on which the polynomial time algorithm from Section works correctly, we did not test 
the implementation of this algorithm on random inputs. However, since this algorithm is simpler 
than the heuristic one, it is plausible to expect that it will run at least as fast. Of course one might 
want to first check whether the input mxn binary matrix satisfies the assumption that no column 
is contained in both columns of a pair of conflicting columns. This can be tested straightforwardly 
in time 0(n^(m -|- n)) by first classifying each ordered pair of columns as conflicting, disjoint, or in 
containment, and then testing each triple of columns for the condition that the first one is contained 
in each of the other two, which are in conflict. The running time of this check is asymptotically 
worse than that of the heuristic algorithm. However, we expect the running times to differ little on 
practical instances of moderate sizes, since the constant hidden in the 0() notation for the above 
check is small. 

In Table we list the numbers of rows (i.e., samples) in the ten original matrices, together with 
numbers of columns and pairwise distinct columns, the lower bounds on the minimum number of 
rows in a conflict-free row split of each of the matrices given by Corollary and the numbers of 
rows and pairwise distinct rows in the matrices output by the heuristic algorithm. The similarity 
between the numbers of pairwise distinct columns in the input and of pairwise distinct rows in the 
output can be explained by the observation that if the input matrix consists of n pairwise distinct 
columns, then there will be at most n distinct rows in the naive solution (split each row into as 
many rows of the identity matrix as the number of Is it contains). Thus n is an upper bound 
for an optimal solution to the Minimum Distinct Conflict-Free Row Split problem, which 
explains why our heuristic algorithm applied to the Minimum Distinct Conflict-Free Row 
Split problem performs similarly as the naive one. In Fig. [^we show four perfect phylogeny trees 
corresponding to the matrices output by the heuristic algorithm. The results on all matrices are 
available online, linked from https://github.com/alexandrutoniescu/MixedPerfectPhylogeny, 

We also ran our heuristic algorithm on the same datasets, with the difference that we removed 
from the input matrices those columns that appear (as binary vectors) strictly less than 2 times 
(this value is a parameter to our implementation). This is the same idea and default threshold used 
by Popic et al. [26j in their tool LICHeE. Their motivation is that several mutations accumulate 
before a new tumor branch separates, and thus such rare mutation patterns may be due to errors 
in the data. We show four output trees in Fig. 

Finally, we also ran LICHeE on the same ten patients from [^. Since LICHeE uses the variant 
allele frequency (VAE) of every mutation, we used the matrices containing VAE values linked from 

and ran LICHeE with 
starts by grouping the 

robust mutations into clusters of size at least a given number, by default 2. (This is the same 
experiment as the one done in the paper [26] introducing LICHeE; see for further details.) We 
show four trees produced by LICHeE in Eig. Note LICHeE does not necessarily output binary 
trees. 


https://github.com/viq854/lichee/tree/master/LICHeE/data/ccRCC 
the parameters indicated therein. As referred to in the above, LICHeE 
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Figure 5: Four perfect phylogeny trees corresponding to the conflict-free row split matrices output 
by our heuristic algorithm from Section]^ The naming convention is: R1 is an original row (i.e., 
sample) name, and Rl_l, Rl_2, Rl_3 are the row (i.e., samples) names corresponding to R1 in 
the output matrix. Equal split rows form a single node of the perfect phylogeny (equalities are 
indicated in boxes). 



(a) EV003 




Figure 6: The perfect phylogeny trees output by our heuristic algorithm, after removing from 
the input matrices those columns that appear only once. The naming convention is as in Fig. 
Compare these trees to the trees from Fig. 3] and to the ones from Fig. 
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Figure 7: The tumor evolutions predicted by LICHeE |26[ . Numbers 
number of mutations at that node. See [26j for further details. 
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8 Discussion 

After hltering the unique columns of the matrices EV003 and RMH002, the resulting matrices 
are conflict-free. The perfect phytogeny for RMH002 (Fig. is the same as the one produced by 
LICHeE (Fig. and the one from Fig. 3]. The perfect phytogeny for EV003 (Fig. is almost 
identical to the one produced by LICHeE (Fig. and the one from Fig. 3], the only difference 
being that each of the pairs of samples {R2, R5} and {Rl, R3} is collapsed into a single leaf of 
the phytogeny. This may be due to the fact that LICHeE and the method used by exploit VAF 
values, not only binary values. Overall, these two matrices suggest that filtering out rare columns 
may be a relevant strategy. 

All the columns of the matrix RMH008 appear at least two times, thus there are no columns 
to be hltered out. While LICHeE finds that only samples R4 and R6 are a combination of more 
leaves of the tumor phylogeny, our algorithm finds that all eight samples are combinations of two 
or more leaves. However, there are similarities to the prediction of LICHeE. For example, both 
samples R4 and R6 have mutations in common with R5 and R7 (R6_2 = R5_l = R7_l, (Rl_3 = ) 
R4_3 = R5_2 = R7_2). Our subclones R6_2 and R4_3 also correspond to the subclones R6_dom and 
R4_min, respectively, found in [^. Samples R4 and R6 also have mutations in common with Rl 
(Rl_3 = R6_3 = R4_3), and with R2 (nodes R2_2 and Rl_3 = R6_3 = R4_3 are both siblings of a 
node at distance 4 from the root). The subclones R6_3 and R4_3 also correspond to the subclones 
R6_min and R4_dom, respectively, found in [^. 

In matrix RK26, 6 columns are unique, and they have been filtered out in Fig. LICHeE 
finds that only sample R5 is a combination of more leaves, while our algorithm again hnds that all 
samples are combinations of two or more leaves. However, there are again similarities in the results. 
Sample R5 has mutations shared with R6, R7, R8 (nodes R5_2 and R6_l = R7_l = R8_l are siblings 
of a node at depth 4). Subclone R5_2 also corresponds to subclone R5_dom found in [^. Sample R5 
also has mutations shared with R9 and RIO (nodes R5_l, R10_l and R9_l have the lowest common 
ancestor at depth 5). Subclone R5_l also corresponds to subclone R5_min found in [^. 
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9 Conclusion 


In this paper we showed hardness of the Minimum Conflict-Free Row Split and the Minimum 
Distinct Conflict-Free Row Split problems, and gave a polynomial time algorithm for the 
Minimum Conflict-Free Row Split problem on instances such that no column is contained 
in both columns of a pair of conflicting columns. More general tractable instances could be found 
by inspecting further dependencies between column containment and conflictness. For example, it 
remains open whether the Minimum Conflict-Free Row Split problem is tractable on matrices 
in which no pair of conflicting columns is contained in both columns of a pair of conflicting columns. 
It would also be interesting to identify polynomially solvable cases of the Minimum Distinct 
Conflict-Free Row Split problem and to explore variations of the problems in which we are 
also allowed to edit the entries of the input matrix. 

In the paper we also gave a polynomial time heuristic algorithm for the Minimum Conflict- 
Free Row Split problem based on graph coloring. We leave as a question for future research 
to determine the (in-)approximability status of the optimization variants of the two problems. 
Finally, we remark that in it was assumed that the matrices have no duplicated columns, 
which was not necessary in this paper. 
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